Some time ago I was playing with .NET WebClient class to download some sites and process them. It's quite simple class that even doesn't return status code of the download, just the body and headers sent by server. It seems that it processes redirects as well.

This week I was just skimming through several blogs and found some broken links. Ok, so why not to use the WebClient to find the website broken links? Below is what I created. It's rather long, so be patient :) Some code could be maybe solved by [IO.Path] class, but I have to check how it works with urls. Ok, lets start..

Just initialize the stuff. There are some regexes that will be needed later.

param([string]$url, 
    [string]$skipFilter,
    [string]$errorContent,
    [int]$debugLevel=-1,
    [string]$logFile,
    [switch]$help)
if ($help) {
    write-host "parametry: -url -skipFilter -errorContent -debugLevel -logFile -help"
    Write-Host "-url - url where to start searching"
    Write-Host "-skipFilter - regex that specifies that links should be skipped"
    Write-Host "-errorContent - some text that is at error page that indicates that something " +
        "went wrong when getting the page"
    Write-Host "-debugLevel - how deep should be the debugging information; " + 
        "-1: no debug, 2: too verbose"
    Write-Host "-logFile - file name; if not specified, the debug info goes to screen;" +
        " if specified, it goes to this file"
    Write-Host "-help - writes this help"
    return;
}
if (!$url) { $(throw "you have to specify -url parameter") }
if ($url -notmatch 'https?://.*') {
    $url = "http://$url"
    write-host corrected url to $url
}

$rLink      = new-object regex '<a\s[^>]*href="(?<path>[^"]+)"[^>]*>', 'Multiline,IgnoreCase'
$rServer    = New-Object regex '^https?://
    ((?<server>localhost/\w+)|  (?# pro localhost)
     ([-\w]+\.)*                 (?# www etc.)
      (?<server>[-\w]+\.[-\w]+)  (?# jmeno serveru)
    )
    ([?/].*|$)                   (?# rest)','IgnorePatternWhiteSpace'
$script:rRoot = new-object regex 'https?://[^/]+'
$script:rDir  = new-object regex '
    ^(?<proto>https?://)(
        (?<path>localhost/[^/]+/?) |       (?# je jednoduse http://localhost/testsite/)
        (?<path>[^/]+/?) |                 (?# je jednoduse http://www.seznam.cz/)
        (?<path>([^/]+/)+)[^/]*            (?# vnoreny, http://www.seznam.cz/abc/x.html)
    )$', `
    'IgnorePatternWhiteSpace,Singleline'

Function to write debugging information. The debugging may be quite detailed if you write with level 2 (the highest level). It may also write the info into the file.

# write debug info
function WriteDebug([int]$level, [string]$str) {
    trap [Exception] { 
        Write-Error "Unable to save debug output"
        Write-Error $error
        Write-Error $error[0].Exception.Message
        continue;
    }
    if ($debugLevel -ge $level) {
        if ($logFile) { "debug: $str" | Out-File -FilePath $logfile -Append }
        else          { Write-Host "debug: " $str -ForegroundColor Yellow }
    }
}

Helper functions to work with urls and for returning info object.

# wrappes the parameters into object
# url - found url 
# from - where the $url was found (address of the page)
function GetLinkInfo([string]$url, [string]$from) {
    $res      = '' | Select-Object Url,From
    $res.Url  = $url;
    $res.From = $from;
    $res
}
# returns server name
function Server([string]$url)  { $rServer.Replace($url, '${server}') }
# returns true if the url doesn't point outside the web site
function IsLocal([string]$url) { $serverName -eq (Server $url) }

Long function GetLinks for getting the links. It takes the source url, examines all the <a href="http://....">go</a> links, picks content of the href attribute, makes the correrct url an returns it.

# takes url address and tries to find all the links found at this site
# (links are all <a href=.. />, (e.g. 'http://www.nikdo.cz', '/', 'o-mne' etc.)
# then translates all the links and returns correct ones
# (e.g. 'http://www.nikdo.cz', 'http://www.nikdo.cz/o-mne')
function GetLinks([string]$murl) {
    $res            = ''|Select-Object Url,ContentLen,Error,Links
    $res.Url        = $murl
    $res.Links      = @()
    WriteDebug 0 "----- getting links from $murl"
    $local:content  = $webclient.DownloadString($murl)
    WriteDebug 2 "getting done"
    $res.ContentLen = $local:content.Length
    WriteDebug 1 "content length: $($res.ContentLen)"
    
    trap [Exception] { 
        Write-Error "Unable to download $murl"
        $res.Error = $error[0];
        return $res
    }
    if (!(IsLocal $murl)) {
        WriteDebug 0 "processing external link stoped: $murl"
        return $res
    }
    if ($errorContent -and $local:content -match $errorContent) {
        $res.Error = 'found error content';
        return $res
    }
    
    $local:matches = $rLink.Matches($local:content)
    if (!$local:matches) {
        return $res
    }
    $uniqueLinks = @{}
    foreach($local:match in $local:matches) { 
        $local:l = $local:match.Groups['path'].Value.Replace("&amp;","&");  
        WriteDebug 2 "link: $local:l"
        if ($local:l -match 'javascript:') { WriteDebug 2 "Skipping javascript"; continue; }
        if ($local:l -match '^feed:')      { WriteDebug 2 "Skipping feed:"; continue; }
        if ($local:l -match '^#')          { WriteDebug 2 "Skipping #...:"; continue; }
        if ($local:l -match '^callto://')  { WriteDebug 2 "Skipping callto:"; continue; }
        if ($local:l -match 'mailto:')     { WriteDebug 2 "Skipping mailto:"; continue; }
        
        $newUrl = MakePath $murl $local:l
        if ($skipFilter -and $newUrl -match $skipFilter) {
            WriteDebug 2 "skipping $local:l"
            continue;
        }
        if (!$uniqueLinks[$newUrl]) {
            if (IsLocal $newUrl) { WriteDebug 2 "Found url $newUrl" }
            else                 { WriteDebug 2 "Found external link: $newUrl" }
        }
        else                     { WriteDebug 2 "Skipping link: $newUrl" }
        $uniqueLinks[$newUrl] = $true
    }
    WriteDebug 2 "Count of links at page: $($uniqueLinks.Count)"
    $res.Links = @($uniqueLinks.Keys)
    return $res
}

Some more helper functions to create desired url from the base url and from the content of href attribute.

# returns url where to go from $url
# url - original address of the page
# go - link found in href attribute
function MakePath($url, $go) {
    #go = '/moje-poznamky'
    if ($go -match '^/')             { (GetRoot $url) + $go }
    #go = '../home.aspx'
    elseif ($go -match '\.\./')      { (ResolveDir $url $go) }
    #go = 'http://www.gonekam.co'
    elseif ($go -match '^https?://') { $go; }
    #go = 'fotky'
    else                             { (GetDir $url) + $go}
}

# returns web page root (e.g. http://nikdo.cz/fotky/musov-2008/ -> http://nikdo.cz/)
# url - source url
# append - text to append to the result
function GetRoot($url, [string]$append) {    $rRoot.Match($url).Value + $append }

# needed when $go begins with ../ (= go to the subdirectory)
# url - original url
# go - link found in href attribute (e.g. '../o-mne')
# example: ResolveDir 'http://www.nikdo.cz/aba/' '../o-mne' -> 'http://www.nikdo.cz/o-mne'
function ResolveDir($url, $go) {
    WriteDebug 2 "Resolving $url / $go"
    if ($go -match '^https?://') {
        WriteDebug 2 "Found https?:// in $go: $(Server $go), returning $go"
        return $go
    }
    while($go -match '\.\./') {
        $go = $go.Remove(0,3);
        $url = $url -replace '(?<r>.*/)[^/]+/[^/]*','${r}'
    }
    $url + $go
}

# removes file name
# this could be maybe solved by [io.path]
function GetDir($url) {
    $local:m   = $rDir.Match($url);
    $local:ret = $local:m.Groups['proto'].Value + $local:m.Groups['path'].Value
    if ($local:ret -notmatch '/$') {    $local:ret += '/'}
    return $local:ret
}

And the main cycle what goes through all the links. Something to the variables:

  • $okPaths - contains valid urls
  • $blindPaths - contains urls that are broken, that means that the $webclient thrown an exception
  • $maybePaths - still waiting urls that will be checked
  • $found - dictionary to hold the visited urls
  • $res - result object to return
The result object has 2 properties: $res.Ok and $res.Blind. The first one contains links that are valid (from $okPaths) and the second the broken links (from $blindPaths).
$webclient  = New-Object "System.Net.WebClient"
$webclient.Headers.Add('User-Agent', 'get-blindLinks getter; ps1 script; Pepa se vam timto omlouva:)')
$okPaths    = @()   # seznam objektu s Url,From
$blindPaths = @()    # seznam objektu s Url,From
$maybePaths = @(GetLinkInfo $url $url) # seznam objektu s Url,From
$found      = @{$url=$true}

$serverName = Server $url
while($maybePaths)
{
    $currUrl,$maybePaths = $maybePaths
    if (!$maybePaths){$maybePaths=@()}
    WriteDebug 1 "Checking $($currUrl.Url)"
    $page                = GetLinks $currUrl.Url
    $found[$currUrl.Url] = $true
    if ($page.Error) {
        $blindPaths += $currUrl
        Write-Host "Blind: $($page.Url)"
    }
    else {
        Write-Host "OK: $($page.Url)"
        $okPaths += $currUrl
        $maybePaths = @($maybePaths)
        WriteDebug 2 "Maybepaths count before: $($maybePaths.Length)"
        foreach($l in $page.Links) { 
            if (!$found[$l]) { 
                WriteDebug 1 "Added new link $l"
                $maybePaths += GetLinkInfo $l $currUrl.Url
                $found[$l]  = $true
            }
        }
        WriteDebug 1 "Maybepaths count after: $($maybePaths.Length)"
    }
}
$res       = ''|Select-Object OK,Blind
$res.Ok    = $okPaths
$res.Blind = $blindPaths
$res

How you can call it:

PS M:\web> $objs = .\get-blindLinks.ps1 http://www.objects.cz -debug 1 -logfi c:\temp\objectsblind.log
And the result? (Ok, in next update, the property will be called Broken :)
PS M:\web> $objs.Blind | fl Url : http://www.objects.cz/en.html From : http://www.objects.cz Url : http://www2.ing.puc.cl/~jnavon/IIC2142/patexamples.htm From : http://www.objects.cz/clanky/clanek11/clanek11.asp Url : http://www.objects.cz/produkty/OM2005/OM2005_1.html From : http://www.objects.cz/clanky/clanek15/clanek15.html Url : http://www.agit.cz/ From : http://www.objects.cz/clanky/clanek8/clanek8.asp Url : http://www.objects.cz/pobyt/pobytovaskoladesignpatterns.asp From : http://www.objects.cz/clanky/clanek10/clanek10.asp

Meta: 2008-07-31, Pepa

Tags: PowerShell web