Using Katelyn Crawler to Find All Domain References

You can use the Katelyn Crawler to crawl a website looking for references to a particular domain. I have an example below, which will report back not just each instance of the domain, but the actual full URL that was found, but don’t limit your imagination.

Katelyn Crawler

You could search for any HTTP references that should be HTTPS, or all links to a domain that you have retired, or to find links that use an old scheme that you have replaced with a new scheme.

Find Links by Domain

The following regular expression can be placed in the “Search Exp” field in Katelyn UI, and will find all fully qualified references to www.example.com. The matches will be output as errors and will include the full URL that was found, for example “At 1356 – https://www.example.com/images/photo.jpg”. The results are organised by the page they were found on, making it super easy to find the reference.

(?:http:\/\/?|https:\/\/?)(www\.example\.com\/.*?)"

You could also completly ignore the http / https part (for example if you were using scheme-relative links that start “//”):

(www\.example\.com\/.*?)"

And you could make further adjustements if you didn’t care about the “www” subdomain in particular, or if you wanted to check some other subdomain.

Highly Flexible

The Search Expression input is highly flexible, if you can write (or find) a regular expression that uses the C# syntax – you can ask Katelyn to find things for you. Because Katelyn is a slow crawler, it won’t destroy your server in the process. You can use a regex tester online to try out your changes before running a crawl.