Exploring Cisco’s Top 1 Million Domains Data

Cisco offers a daily list of the million most queried domain names from Umbrella (OpenDNS) users.    I had some time this weekend so decided to spend some time playing around with the data to see what I could find so I spun up a lightsail server and got to work.
Grabbing the file is as simple as:
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
You can retrieve a specific date like this:
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m-yyyy-mm-dd.csv.zip
(Looks like 2017-01-20 is the earliest they have online).
Once you get that downloaded and unzipped (unzip top-1m.csv.zip) you can start exploring.
You can pull out the top 10 domains with this command:
head -n 10 top-1m.csv

1,google.com
2,www.google.com
3,microsoft.com
4,facebook.com
5,doubleclick.net
6,g.doubleclick.net
7,clients4.google.com
8,googleads.g.doubleclick.net
9,apple.com
10,fbcdn.net

(Full Output)

You can search for keywords with this command:
cat top-1m.csv | grep "opendns"

437,opendns.com
719,hydra.opendns.com
720,sync.hydra.opendns.com
1314,disthost.opendns.com
2756,api.opendns.com
4565,cacerts.opendns.com
5569,ipf.opendns.com
5699,block.opendns.com
7024,updates.opendns.com
8482,bpb.opendns.com

(Full Output)

To count the domain levels use this command:
awk -F, '{count=split($2,a,"."); print count}' top-1m.csv | sort | uniq -c | awk '{print $2,$1}' | sort -k1,1n

1 1086
2 263509
3 469756
4 193802
5 54281
6 13698
7 2952
8 689
9 172
10 16
11 26
12 2
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1

(Full Output)
Notice anything strange here? Hint: A domain name requires at least two levels to be valid.

To find the broken DNS names in this list this command works:
cat top-1m.csv | awk -F, 'BEGIN {file="top-1m.csv" ; while ((getline line < file) > 0) {if (line ~ /#/) continue; tld[tolower(line)] = 1}} {foo=split($2,a,"."); if (foo == 1) {if (!(a[1] in tld)) {print $0}}}'  

1200,home
1490,local
2082,za
3916,lan
6350,url
10173,belkin
10869,uop
11187,localdomain
12887,localhost

(Full Output)

Find domains added to the list for today.
I  wrote a script to download the last two days of files and compare them for new domains:
https://gist.github.com/jgamblin/184590e2ba64371730e435ab2977e4cf

You can find the output for April 24, 2017 here.

Overall I am really impressed with this data and will be using it to do more research and to track trends across the internet.  They have some more to do but it is an amazingly valuable free tool.
Also recently I have feel in love with sprunge to push data to an ad free “pastebin” from the command line:

cat file.txt | curl -F 'sprunge=<-' http://sprunge.us

Site Footer