500K HTTP Headers

Recently we crawled the Top 500K sites (as ranked by Alexa). Following requests from readers we are making available the HTTP Headers for research purposes.

The publication of the statistics of WordPress usage is an example of the research that can be conducted. It is possible to determine Web Applications, Web Servers, Server side scripting, Load balancers and much more.

HTTP Headers that could be examined:

Security Headers

  • HTTP Only (Set-Cookie)
  • X-Frame-Options
  • X-XSS-Protection
  • X-Content-Security-Policy

Server Headers

  • Server:
  • X-Powered-By:

Recommended Tools for Analysis

A number of basic text manipulation tools will make it easier to search through the data. Start with a *nix based system; grep, cut, sed and some simple bash scripting will make your life easier. The file contains 5 folders with 100K headers in each. The headers will have to be correlated with the site list file to determine the site host name.

When counting sites with grep be sure to use the -m 1 this will ensure that you do not get a count of two from sites with multiple headers (HTTP 302 Redirects).

Our hosted Open Source Security Tools allow you to scan for vulnerabilities on any Internet facing IP address. Nothing to install.

,

2 Responses to 500K HTTP Headers

  1. souders May 16, 2014 at 3:27 pm #

    The HTTP Archive (http://httparchive.org/) is another source for this information. The world’s top 300K URLs are crawled twice each month. In addition to HTTP Headers, the HTTP Archive has screenshots, loading videos, aggregate stats (use of fonts, size of scripts, etc.), and performance analysis. The HTTP Archive is part of the Internet Archive. All the code and data is open source.

    • hdm May 18, 2014 at 1:38 pm #

      Dont forgot https://Scans.IO/ – Weekly uploads of every internet-facing web server (150m records), including headers and content.