Every clickstream analysis has to face a problem of identifying users. It’s crucial to connect consecutive page views from the same client, but it’s also important to recognize the same user across different visits. Storing User ID in a cookie is a standard way of doing it, however it has some limitations and in certain situations different approach may give better results.
Below is the short summary of different techniques that can be helpful in tracking user without cookies (or at least not only with cookies).
- agent string
- screen resolution
- color depth
- installed plug-ins
- time zone
- supported mime types
If we put all the data together we will get almost unique user fingerprint, as few users will share the same hardware/software configuration. However there are some drawbacks. As more identical devices appear on the market (like the same model of a smart phone) more and more clients with have identical configurations, so when dealing with mobile users this approach can have some limits.
Here is an example of this method implemented.
- on first elemement we define font-family: <font_to_be_tested>, <reference_font>
- and on the other: font-family: <reference_font>
where reference font is any font we are sure to find on the client system. It works best if that font looks different (has different width) that the font we are testing. The idea is, that after rendering those two elements we measure their width and check if they are equal. If the font we are checking was not available, the first HTML element will fall back to reference font, resulting the same width as the other element.
As far as I remember this method was used to estimate the percentage of users who has MS Office or Open Office installed. Each Office suit comes with it’s own set of fonts, so it’s easy to check if client has installed any of them. Unfortunately, I cannot find any links to that research right now.
HTTP Entity tags is a mechanism used for cache invalidation. When client requests a resource, in the response headers server may add ETag value (version of that resource or checksum). The next time client requests for this resource this ETag should be included in the headers so server could answer with “Not Modified” or send back a new version, if resource was changed.
This can be also used for user tracking. Imagine that ETag stores not a resource version but a user ID. Each time client request for a resource this ID will be sent in headers, just like cookies. Here you can find another example and detailed description.
This doc presents another way of client fingerprinting. Besides some generally known client details (like user agent or screen resolution) we can render some text or graphics on HTML5 canvas element. The idea is that even if we render the same picture or text it won’t be identical on different clients due to different operating systems and hardware (Graphical Processor Unit). When used with other client specific data it can be very helpful in precise user fingerprinting.
Check your browser
To sum up, cookies are the most common way to identify users, but each client has a lot of features which make him almost unique. Staying anonymous requires much more effort and consideration than just turning off cookies in a web browser. What makes the situations even more complicated, as EFF research points out, sophisticated techniques intended to enhance privacy are so rare among users, that they can be themselves very helpful in identifying users.