Reference Rot (also called linked rot) is when hyperlinks, over time, cease to point to their originally targeted file, web page, or server due to that resource being relocated to a new address or permanently unavailable.
Tod Beardsley from the CVE board gave a talk at the 2023 CVE Global Summit called ‘Link Rot: The Problem and Archiving for Posterity‘ on the problem of reference rot, sadly I was not able to attend, but I was interested in how bad the problem was, so this past week I wrote some code to find out.
First, I collected the CVE data from the NVD data feeds and parsed all the references.
Here are the high-level numbers:
Reference Links: 797,474
Average References Per CVE: 4
CVE-2014-0224, an OpenSSL vulnerability, has the most reference links with 303(!?) unique references.
Finding Reference Rot
I knew I could not hand-check the nearly 800,000, so the logical first step was to see how many of the 13,991 unique domains responded to DNS requests using PyNnlookup and Google’s Public DNS [188.8.131.52].
It took over thirty minutes to run the 13,991 DNS lookups, and when done 1,520 of the listed hosts did not respond.
How Bad Is Reference Rot?
One domain, SecurityFocus.com, is responsible for 82,854 or over 75% of all broken links. The top 5 domains account for over 90% of all broken links.
What Can Be Done?
You may be able to point the dead links to the Wayback Machine at archive.org or a similar site, but this will need to be figured out before we lose tons of valuable vulnerability research.
Looking at DNS records is probably the simplest form of checking links and is not 100% foolproof. To get a 100% complete picture of the broken links, you would need to pull the web pages and do some form of data verification that was beyond this project’s scope.
Here is a picture to make my unfurl link look better: