Cookieless user tracking

Every clickstream analysis has to face a problem of identifying users. It’s crucial to connect consecutive page views from the same client, but it’s also important to recognize the same user across different visits. Storing User ID in a cookie is a standard way of doing it, however it has some limitations and in certain situations different approach may give better results.

Below is the short summary of different techniques that can be helpful in tracking user without cookies (or at least not only with cookies).

Put data from cookies to JavaScript

First of all, you don’t have to store user specific data in cookies. Interesting technique was presented here (author mentions Michał Zalewski’s The Tangled Web as an inspiration). Let’s say you have a separate JavaScript file for user tracking (tracking.js). You use that script on your page. Each time user requests that file the server will embed randomly generated number in one variable value (userID). The tracking.js file should be cached, so on consecutive request for that file the server will respond Not Modified. Thanks to this all future visits will use the same userID. This requires of course cache to be preserved and it will not work in Private/Incognito mode.

Client specific data available from JavaScript

In JavaScript there are many variables that describe the client’s system on which the script is executed. Among others:

  • agent string
  • screen resolution
  • color depth
  • installed plug-ins
  • time zone
  • language
  • supported mime types

If we put all the data together we will get almost unique user fingerprint, as few users will share the same hardware/software configuration. However there are some drawbacks. As more identical devices appear on the market (like the same model of a smart phone) more and more clients with have identical configurations, so when dealing with mobile users this approach can have some limits.

Here is an example of this method implemented.

Installed fonts

As far as I know we cannot get in javascript a list of all installed fonts, but we can check if a given font is available on the client. The trick is described here. We have to pick two fonts: one — we will test if it is present on the client system, and the other which will be used as a reference font. We render the same text in two separate HTML elements. CSS styles should be identical except for font-family property:

  • on first elemement we define font-family: <font_to_be_tested>, <reference_font>
  • and on the other: font-family: <reference_font>

where reference font is any font we are sure to find on the client system. It works best if that font looks different (has different width) that the font we are testing. The idea is, that after rendering those two elements we measure their width and check if they are equal. If the font we are checking was not available, the first HTML element will fall back to reference font, resulting the same width as the other element.

As far as I remember this method was used to estimate the percentage of users who has MS Office or Open Office installed. Each Office suit comes with it’s own set of fonts, so it’s easy to check if client has installed any of them. Unfortunately, I cannot find any links to that research right now.

HTTP ETag

HTTP Entity tags is a mechanism used for cache invalidation. When client requests a resource, in the response headers server may add ETag value (version of that resource or checksum). The next time client requests for this resource this ETag should be included in the headers so server could answer with “Not Modified” or send back a new version, if resource was changed.

This can be also used for user tracking. Imagine that ETag stores not a resource version but a user ID. Each time client request for a resource this ID will be sent in headers, just like cookies. Here you can find another example and detailed description.

Canvas fingerprinting.

This doc presents another way of client fingerprinting. Besides some generally known client details (like user agent or screen resolution) we can render some text or graphics on HTML5 canvas element. The idea is that even if we render the same picture or text it won’t be identical on different clients due to different operating systems and hardware (Graphical Processor Unit). When used with other client specific data it can be very helpful in precise user fingerprinting.

Check your browser

You can have a look at this site to get fingerprinted. I have put there several properties available from javascript and a combined client hash computed by https://github.com/Valve/fingerprintjs.

To sum up, cookies are the most common way to identify users, but each client has a lot of features which make him almost unique. Staying anonymous requires much more effort and consideration than just turning off cookies in a web browser. What makes the situations even more complicated, as EFF research points out, sophisticated techniques intended to enhance privacy are so rare among users, that they can be themselves very helpful in identifying users.