Data sets of the top 1 million Internet sites are simply compiled lists of web sites (or domains) that are found to have the most traffic. What follows are some of the most popular and well known data sets of the Top 1 Million Sites.
Depending on the methodology used, the results can have significant variability. However, having a reasonably accurate list is beneficial to the many use cases these lists can be applied to.
Alexa Top 1 Million
Established way back in 1996, Alexa had a popular toolbar addon for web browsers. By using the data collected by the toolbar, Alexa developed a top sites list and made it available via a web application.
The Alexa list, while primarily aimed towards marketers, was used for many research projects. It was reasonably accurate, easily accessible, and became the most well-known resource.
Alexa also offered a Top 1 Million List in
CSV format that could be downloaded for Free. This was an excellent resource and it found many use cases.
Now owned by Amazon, they have recently restricted access to the top 1 million list to paying customers. For a time, there was a list available at http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, however, this appears to be no longer updated and incomplete.
The Cisco Umbrella list is quite different. Still based around the top 1 million most popular sites, the list is put together from Cisco's visibility into DNS traffic. Rather than being primarily around what are the most browsed to sites, they are getting what are the most popular host names being resolved in DNS.
As it is based around popular DNS requests, there are domains in the list that are not in the Alexa list. Subdomains of primary sites that host other web resources (js / css / images) and even tracking domains used by analytics packages.
The use cases for this list tend towards security and network monitoring. The security use case is not surprising given that Cisco maintains and compiles the list.
"Although the data source is quite different from Alexa’s, we believe it’s arguably more accurate as it’s not based on only HTTP requests from users with browser additions. The way the ranking is computed is not as simple as the net sum of all DNS queries." -- Cisco Umbrella
Publishes a list daily that is compiled after analysis of web crawls. Sites are ranked based on backlinks. This is a similar methodology used by search engines.
Majestic's primary use case is marketing and SEO.
Aimed at marketers the data is based on traffic from "Internet Service Providers and Toolbar Providers". For this reason, the data is only for US based traffic, and updates are provided monthly.
In the past, this was a free resource, but it now requires an account.
A recently minted list, this Free to download list uses methodology that combines some of the other top 1 million site lists mentioned above. By using a combination of lists they believe they have a more accurate list and have even written a paper to explain it.
Created by the team over at ripe.net; they published an interesting article comparing Alexa, Cisco Umbrella, Majestic & Quantcast.
As shown clearly in this graphic there is very little similarity between the different lists.
Another marketing focused site that offers data. Only the top 50 sites are available from the site unless you upgrade to a paid plan.
Moz is a search engine optimization service (SEO). They have a large data set of search related data. Using this, Moz makes available the top 500 sites for Free.
Established in 1995, Netcraft is another company that has been around since the early days of the Internet. Internet Data Analysis and Security would describe the core functions of Netcraft. They have extensive data on web hosting across the Internet going back to 1995.
Some of the work performed by Netcraft results in the takedown of phishing sites and other cybercrime-related measures.
Using data from CommonCrawl and CommonSearch, the DomCop project has compiled a list of the top 10 million sites. Better yet, the full site list is available for Free Download.